Dedication:
I dedicate this to the times I longed to become a data analyst while I worked in non-analytic financial positions.
“I wasn’t hungry for money. I was hungry for success” -Raul M.
Enjoy
Prosper is advertised as Americas first marketplace lending platform, with over $9 billion in funded loans.
Prosper allows people to invest in each other in a way that is financially and socially rewarding. On Prosper, borrowers list loan requests between $2,000 and $35,000 and individual investors invest as little as $25 in each loan listing they select. Prosper handles the servicing of the loan on behalf of the matched borrowers and investors.
We explore the 2014 Prosper data containing 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
We want to bring in the following libraries:
knitr (General-purpose tool for dynamic report generation)
dpylyr (data manipulation library)
tidyr (data manipulation library)
ggplot2 (visualization library)
gridExtra (grid visualizations)
These libraries should sufficiently allow us to manipulate and explore the dataset.
Warning messages:
1: running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS P4-EDA-in-R-ProsperLoanData.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoc15ac6b914298.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\rmald_000\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\rmarkdown-str15ac498b146f.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=P4-EDA-in-R-ProsperLoanData.Rmd --include-after-body "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\file15ac31da5692.html" --variable code_menu=1 --variable kable-scroll=1' had status 1
2: running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS P4-EDA-in-R-ProsperLoanData.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoc15ac6e02a5f.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\rmald_000\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\rmarkdown-str15ac24a32279.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=P4-EDA-in-R-ProsperLoanData.Rmd --include-after-body "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\file15ac3ca244a2.html" --variable code_menu=1 --variable kable-scroll=1' had status 1
3: running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS P4-EDA-in-R-ProsperLoanData.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoc15ac62722295.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\rmald_000\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\rmarkdown-str15ac92443c.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=P4-EDA-in-R-ProsperLoanData.Rmd --include-after-body "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\file15ac2e3548c8.html" --variable code_menu=1 --variable kable-scroll=1' had status 1
4: running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS P4-EDA-in-R-ProsperLoanData.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoc15ac33da1809.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\rmald_000\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\rmarkdown-str15ac69b07cd5.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=P4-EDA-in-R-ProsperLoanData.Rmd --include-after-body "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\file15ac464d571c.html" --variable code_menu=1 --variable kable-scroll=1' had status 1
5: running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS P4-EDA-in-R-ProsperLoanData.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoc15acdea2dca.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\rmald_000\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\rmarkdown-str15ac322b51a4.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=P4-EDA-in-R-ProsperLoanData.Rmd --include-after-body "C:\Users\RMALD_~1\AppData\Local\Temp\RtmpOeJqSg\file15ac7e777f6.html" --variable code_menu=1 --variable kable-scroll=1' had status 1
After importing the csv file “propserLoanData.csv,” we observe a glimpse of what the prosper data “df” will be below:
Observe that we only called two rows from an 81 variable dataset.
Additionally, we look at the feature types for the first 10 rows and first 3 columns
The folowing are the column names, number of rows, and number of columns respectively.
Column Names:
names(df)
[1] "ListingKey"
[2] "ListingNumber"
[3] "ListingCreationDate"
[4] "CreditGrade"
[5] "Term"
[6] "LoanStatus"
[7] "ClosedDate"
[8] "BorrowerAPR"
[9] "BorrowerRate"
[10] "LenderYield"
[11] "EstimatedEffectiveYield"
[12] "EstimatedLoss"
[13] "EstimatedReturn"
[14] "ProsperRating..numeric."
[15] "ProsperRating..Alpha."
[16] "ProsperScore"
[17] "ListingCategory..numeric."
[18] "BorrowerState"
[19] "Occupation"
[20] "EmploymentStatus"
[21] "EmploymentStatusDuration"
[22] "IsBorrowerHomeowner"
[23] "CurrentlyInGroup"
[24] "GroupKey"
[25] "DateCreditPulled"
[26] "CreditScoreRangeLower"
[27] "CreditScoreRangeUpper"
[28] "FirstRecordedCreditLine"
[29] "CurrentCreditLines"
[30] "OpenCreditLines"
[31] "TotalCreditLinespast7years"
[32] "OpenRevolvingAccounts"
[33] "OpenRevolvingMonthlyPayment"
[34] "InquiriesLast6Months"
[35] "TotalInquiries"
[36] "CurrentDelinquencies"
[37] "AmountDelinquent"
[38] "DelinquenciesLast7Years"
[39] "PublicRecordsLast10Years"
[40] "PublicRecordsLast12Months"
[41] "RevolvingCreditBalance"
[42] "BankcardUtilization"
[43] "AvailableBankcardCredit"
[44] "TotalTrades"
[45] "TradesNeverDelinquent..percentage."
[46] "TradesOpenedLast6Months"
[47] "DebtToIncomeRatio"
[48] "IncomeRange"
[49] "IncomeVerifiable"
[50] "StatedMonthlyIncome"
[51] "LoanKey"
[52] "TotalProsperLoans"
[53] "TotalProsperPaymentsBilled"
[54] "OnTimeProsperPayments"
[55] "ProsperPaymentsLessThanOneMonthLate"
[56] "ProsperPaymentsOneMonthPlusLate"
[57] "ProsperPrincipalBorrowed"
[58] "ProsperPrincipalOutstanding"
[59] "ScorexChangeAtTimeOfListing"
[60] "LoanCurrentDaysDelinquent"
[61] "LoanFirstDefaultedCycleNumber"
[62] "LoanMonthsSinceOrigination"
[63] "LoanNumber"
[64] "LoanOriginalAmount"
[65] "LoanOriginationDate"
[66] "LoanOriginationQuarter"
[67] "MemberKey"
[68] "MonthlyLoanPayment"
[69] "LP_CustomerPayments"
[70] "LP_CustomerPrincipalPayments"
[71] "LP_InterestandFees"
[72] "LP_ServiceFees"
[73] "LP_CollectionFees"
[74] "LP_GrossPrincipalLoss"
[75] "LP_NetPrincipalLoss"
[76] "LP_NonPrincipalRecoverypayments"
[77] "PercentFunded"
[78] "Recommendations"
[79] "InvestmentFromFriendsCount"
[80] "InvestmentFromFriendsAmount"
[81] "Investors"
Number of rows: 113937
Number of columns 81
This 113,937 by 81 loan data is gargantuate. We only outputed the first three rows, above. Ingesting this dataset, we would undergo heavy time costs to our analysis. To improve computing performance and minimize time on project, for the sake of this analysis, we create several subsets from the Prosper loan dataframe “df.”
Note: What’s that “NA” entry that frequently comes up?
Notice how we didn’t introduce the data as “tidy”“,”clean“”, or even mention how we extracted the data. I.e., We didn’t introduce any audited data, in which could imply our data has not been maintained or corrected to analysis standards.
We observe the total amount of null values in the Prosper dataframe df 64
We have 1078519 total missing values in our dataset.
Below is a breakdown of the total null values per column in the df dataframe:
The mean NA count for all columns is 0.7901235
13315.05 is the estimated average amount of null values in some column. What percentage of our df dataframe is this?
Dividing the number of rows of the dataframe by the mean of the columns, we observe 6.934739910^{-6} entries missing.
We have two choices:
Should we delete customer data(rows) where some total null values of a given column are significantly high, such as +1000 points?
Fill in the missing non-null value entries for each column?
Say you purchase a meal for $14.99. You eat a majority of it, but then become full. Do you eat the food, or leave it be?
This is subjective, though I feel I would eat the food because I am lower income individual that worked hard for every single dollar–and I certainly don’t want food to go to waste(that is, I don’t want my money to go to waste)
Simlarly, this estimated 11% loss of data appears to be insignificant, but preservation of such details is important in our analysis. Obtaining the data is a great ordeal, and we should preserve the, once was diffcult, capute of information.
So, for the sake of minimizing data loss, we fill in the null values of the dataframe.
We will use the fill function from tidyr to conveniently fill in missing data with the last recored non-null value in each category.
Example: if row 81 column 8 had an entered value 3.14159, and had several consecutive null values after, the fill function will fill in the null values with 3.14159. (Note: This filling in of data induces bias. However, we preserve data. A detailed discussion will be made at the ending of this report)
After filling in the data, we again observe the total amount and per-category null values in the Prosper dataframe df was 64.
We have a total of 16 missing values. It’s interesting that there is still some residue of null values. However, we will not pursue this interesting route, for the sake of simplicity in this project.
Note: If one were to go more in depth, one would observe these remaining outlier variables, reiterate, and then repeat process.
Recalling the two solutions for dealing with missing data, we decide to simply delete these remaining 16 values. Though these 16 persons with some missing value can have an influence in our analysis without them, we assume their impact to the analysis to be unlikely.
Below is the fraction at which they represent our dataset 1.404284810^{-4}
This removal is a small, small, risk we are taking.
After removing the residual nulls,
We have 0 Null values remaining
After removing the 16 customer information, rows, we proceed to subsetting our data.
We generate two subset data from our Prosper loan dataset.
These two subsets are based on the idea of perspectives of a Prosper loan.
How does one Prosper customer interact/feel towards the loan structure?
How does one Prosper investor interact/feel towards the loan structure?
The Prosper customer subset “loan_customer_perspective” will contain the following variables:
The Credit Grade
Loan Status
Borrower Rate
Estimated Loss
Prosper Rating
Available Bank Card-Credit
Below is the first few rows of the customer perspective dataset.
Furthemore, the following is table statistics for this dataset:
We observe a frequency table for Credit Grade and Loan Status, and numerical calculations for the remainder. I.e., we see counts for the frist two features, but see mean, range, etc in the other variables.
Recall factor sets. The first features are factor variables, while the remainder are numerical.
In regards to the statistical information, the minimum, 1st quantile, median, and 3rd quantile are 0. However, mean is non-zero value. We interpret this as the Current Delinquencies feature have several zeros.
Morover, we observe “other” classification for Credit Grade and Loan Status features. We explore these observations through visual analysis in the “Prosper Customer Analysis” section.
The Prosper investor subset “loan_investor_perspective” will contain the following variables:
Employment Status
Bank Card Utilization
Investors
LP_Collection Fees
Prosper Score
IsBorrowerHomeowner
CurrentDelinquencies
Below is the first few rows of the loan_customer_perspective dataset
We observe a frequency table for the “Employment Status” and “Borrower Homeowner?” features, and numerical calculations for the remainder. I.e., we see counts for the first two features, but see mean, range, etc in the other variables.
Recall factor sets. The first features are factor variables, while the remainder are numerical.
In regards to the statistical information, the minimum, 1st quantile, median, and 3rd quantile are 0. However, mean is non-zero value. We interpret this as the Current Delinquencies feature have several zeros.
** Motviation **
If we were a potential Prosper customer/borrower, what considerations should we factor in to be successful in paying off loan. I.e., If we were a borrower, what characteristics should I emulate to be well-off with Prosper (even considering I may want to borrow with them again, in the future)
Again, we are utilizing the loan_customer_perspective dataset. The following are the dataset variables available to us:
Credit Grade (Factor Variable)
Borrower Rate (Factor Variable)
Prosper Score (Num variable)
Loan Status (Num variable)
Estimated Loss (Num variable)
Current Delinquences (int variable)
The size of our data set is 40000 rows x 6 columns.
This is the count of the status of each Prosper customer loan. Of course, we observe higher completetions, current, and charge offs than deliquent cases–that’s good!
We observe a large amount of empty values with the Credit Grade information. This is alarming. We consider a few cases. Was there data loss in moving this information? Did Prosper take higher initiative to focus on this information in 2014? Did we mess up during the data cleaning process?
We observe our estimated loss of the original principle from chargeoffs ranges in the 10% region. From the following summary, we observe that the average estimate loss was actually 8.032%
The maximum value at which some estimated loss occured was at 36.6%. This is alarming. However, we should observe if this is a typical ceiling scenario in loss from delinquency. As a customer, we don’t want to default with a high loan. (As an investor, we do not want to see such loss!)
We observe that the significant 36.6% loss was, in fact, part of a cluster of outliers, as seen by the dots of the above right diagram. These outliers are abnormal cases.
It’s good to note that at most 20%, the end of the vertical line on the right graph, of investment is lost. For a customers taking out a loan after July 2009, there likelyhood of defaulting is in 10-20 percent range.
However, this is no excuse from the investor perspective.
Simular to our Estimated loss and Loan Status, we observe a lack in delinquent customers in our sample of the Prosper customer information. This conclusion can be seen from the large amount of 0 valued current delinquencies.
We now consider bi-variate cases. We display visualizations for these cases using scatterplots.
From the red stars, We observe most of the customer loan data centralized at the Borrower Rate of 0.25, 25% in most cases.
Though, we can just utilize the below statistical information to resolve this question.
So in fact, the average Borrower Rate for Prosper Loans is 0.193279
A little off, but close. This unexpected average could be weighed down by lower values. But what happens if we find the median borrower rate?
The median borrower rate 0.184We will reference this value as our central focal point of Borrower rate values.
Another observation is the concentration of dark colored data points in the non-past due and non-default cases. Again, this is because our data capture had no significant past due or default cases.
Note: There is a big concentration of chargeoff loans. One may ask why there is a major severity in this category. I would say that this is a recorded tally throughout Prosper’s existense, and thus is why we see a large amount of charge offs, yet see a lack of Prosper customers being past due/defaulted.
Again, There is a large concentration of dark colored data points in the non-past due and non-default cases. Again, this is because our data capture had no significant past due or default cases.
Note: There is a big concentration of chargeoff loans. One may ask why there is a major severity in this category. I would say that this is a recorded tally throughout Prosper’s existense, and thus is why we see a large amount of charge offs, yet see a lack of Prosper customers being past due/defaulted.
Second Note: Notice there is a darker concentration for the outliers in Charged Off and Complete categories, compared to the Borrow Rate vs Loan Status information. I am not leading in to draw connections to the two scenarios. However, I am addressing that this situation has something interesting. When an estimated loss outlier occurs, it ranges into the 30% Estimated Loss region.
One would anticipate Prosper Scores would be higher if a loan status was in good standing (current, completed, etc.) However, we observe a dark concentration in Charged off, completed, and current categories in the entire Prosper Score range, and not just the top 10% of scores.
Moreover, we observe that charge off has quite a few Prosper Scores of 7 and up. This reason could, again, lie in the fact that this category is an overall collection of Charge Off information, containing false positives.
Notice as the borrower rate increase, the estimated loss percentage increases as well. One may believe that there is some association between borrower rate and estimated loss.
We utilize the following regression model to determine if there exists some significant connection between borrower rate and estimated loss. $ EstimatedLoss = 0.4389099{(BorrowerRate)} + -0.0045159$
loan_customer_perspective.md1 <- lm(EstimatedLoss ~ BorrowerRate,
data = loan_customer_perspective)
summary(loan_customer_perspective.md1)
From the summary information from above, we determine that this model is significant under the assumption there is a 0.001% chance of the model being random.
Note: Let’s assume some connection between two variables is due to chance, say 5%. If our variables were statistically significant, two stars would appear to the far right of the table of variables above. However, we see three stars. If we look at the Signif. codes, we see that three stars is associated with 0.001. Hence, we utilize 0.001 as our \(\alpha\) value in this situation.
Additional Note: However, we should still be skeptical of our model. Though there is a high Adjusted R^2 value of .4926 and significance between the two variables, we need to ensure no bias, error, or even lurking variables affect this model.
Between Current Delinquencies and Prosper Score, we do not see a connection. And this should be true. Delinquencies happen by chance, and not by some explicit finding/observation in our data. > Note: However, there may be a case where some connection between Current Delinquencies can be predicted, from associating it with the remaining 70+ variables in the original Prosper Loan dataset.
However, let’s determine some association by factoring in the Loan Status feature. Below is a similar graph from the previous visualization. But, the only differnce is factoring in the Prosper customer loan status.
There appears to be no significant findings from considering in another variable. I.e., The above visualization is just a similar understanding to that of the original example.
It’s good to know that we tried, right?
Now we switch things up. We consider the estimated loss vs borrower rate visualization from before. The following is a reminder of the scatterplot:
This time, we factor in Prosper Score in our visualization, like so:
Again, notice as the borrower rate increase, the estimated loss percentage increases as well. This time, as we increase our estimated loss, meanwhile observe borrower rate increasing too, we see that Prosper scores tend to decrease as the degree of estimated loss occurs.
This is trivial. In hindsight, we categorize people that negatively effect us to be “bad” people; it’s human nature.
To this opinion, we extend the previous linear regression from before by factoring in Prosper Score.
$ EstimatedLoss = 0.3076{(BorrowerRate)} -0.008560(ProsperScore) + 0.07183$
loan_customer_perspective.md2 <- lm(EstimatedLoss ~ BorrowerRate + ProsperScore, data = loan_customer_perspective)
summary(loan_customer_perspective.md2)
From the summary information from above, we determine that this model is significant under the assumption there is a 0.001% chance of the model being random.
Note: Let’s assume some connection between two variables is due to chance, say 5%. If our variables were statistically significant, two stars would appear to the far right of the table of variables above. However, we see three stars. If we look at the Signif. codes, we see that three stars is associated with 0.001. Hence, we utilize 0.001 as our \(\alpha\) value in this situation.
** Motviation ** Now, we look into the Prosper Investor perspective. Usually Investors invest in some type of portfolio of Prosper customer loans to seek investment return.
With modern day data-driven companies like Prosper, we suspect that the investor acknowledges some default rate in there investment, Prosper service fees taken out, and some return is made as well.
However, what information can we extract from the Prosper data “df” such that an investor knows what financial factors to consider in his/her investment.
Note: The above conclusion can be made from such financial institutions like Lending Club
Again, the Prosper investor subset “loan_investor_perspective” will contain the following variables:
Employment Status
Bank Card Utilization
Investors
LP_Collection Fees
Prosper Score
IsBorrowerHomeowner
CurrentDelinquencies
names(loan_investor_perspective)
Most of the investor contributions to a Prosper loan lie near 100. The case for this is empy values “” being entered into our data. If we were to observe this in full, we might anticipate a more normalized histogram.
Taking this to a different scale, specifically the log scale, we observe the following:
We confirm the huge 0 value spiked of investors from above. Disregarding this observation, we also observe that there is a normal count of investors near the middle region of the range 10-1000, as seen below
Typically homeowners have some pre-existing loan with a financial institution. Is there a case for Prosper customers where investors would favor non-homeowners or homeowner?
We observe there is no bias towards being or not being a homeowner, when attempting to borrow from Prosper. As a customer, this is good.
Moreover, we observe that Prosper investors are investing into more randomized packages.
Note: Of course, due to discrimination regulations in financial insitutions, we should suspect that there exists no bias in this case. Maybe if the regulations were not imposed, would be see the opposite sceneario?
We observe a large amount of non-bank card utlizations records. That is alarming. However, maybe a borrower does not have a bank card, and they are simply borrowing with the intent of not opening a card or even not being permitted to hold a card.
If Prosper has a mission to get into the riskier market of underprivileged person under financial constraints, we can conlude why there exits a large amount of non-bank card users.
Another observation is just the ever so climbing bank card utilization as we look along the x-axis.
Lastly, noticed the spikes in our observation. We first thought some cyclic pattern occurred from some unkown variable or just assume “seasonality.” However, let’s consider another thought.
It could be the case that reports from some financial institutions realease customer information in a more friendly manner. I.e., maybe the input of customer bank card utilization is more on a list of options compared to typing in the full bank card utilization rate.
Again, we observe a normal distribution among Prosper investors and what they invest in. I.e., we are noticing that investors are no selecting loans that guarantee success.
Let’s look at another Investor observation.
If I was an investor, am I able decipher and choose loans based on if a borrower is a homeowner?
Again, we observe an investor does not choose borrower loans. There is no significant finding that an investor decisively invests into borrowers based on being a homeowner.
What is the case if an investor wants to avoid borrowers with delinquencies?
We observe that most investors end up investing into Prosper loans with some type of delinquency. We re-scale the visualization before concluding that investors, based on the graph, prefer Prosper loans with current delinquencies.
We observe that most investor contributions do not have a significant outcome of low delinquency situations. Of course, there is a more likely possibility investing into Prosper loans, and having no delinquent borrowers in return.
We are interested in seeing if investors would be more inclined to investing into loans where a borrower has bank card usage.
We observe some rectangular cluster in between 0 and 1. Thereafter, there just seem to be several outliers. This 0-1 range is the percentage of usage for borrowers. Because there is no visual pattern being displayed, we observe that, again, investors are not strategically investing.
We observe from the overall perspective that there is
no concrete leaning to avoid bad Prosper loans. As a Prosper borrower, we see from the following image that there exists investors investing in delinquent loans. As an investor, they can now observe they are undergoing risk in investing into loans.
Even in the case of Prosper Score and being a homeowner, we observe no investor bias towards some spectrum, as seen below
Out of curiosity, is there at least some abnormal pattern within Prosper lending operations?
The below graph is an indication that there exists no abnormal pattern. Moreover, we see that there is a pattern between estimated loss and borrower rate. Of course, that would be suspected in Prosper operations. We would like to track the bad accounts/borrowers. However, Prosper does not suspiciously categorize individuals from other discriminative occures.
Meanwhile observe borrower rate increasing too, we see that Prosper scores tend to decrease as the degree of estimated loss occurs.
This is trivial. In hindsight, we categorize people that negatively effect us to be “bad” people; it’s human nature. We categorize persons on some scale of their consequential patterns against the Prosper lending operations.
We conclude that there is no abnormal pattern between Prosper borrowers and Prosper operations. If I was a potential Prosper customer, I would like to take out a loan given the fact of the lack of bias in what type of customers they want.
As an investor, we saw that they invest randomly. That is, the investor does not choose what type of loans they invest in. And that is normal.
If you saw the Lending Club video earlier, you would see that investors agree to invest to some package of borrower. This package has calculated risk of 10%. Of course, a financial institution abides to that with specialized insights through data.
So, we observe an investor chooses to invest, but not which borrower loans to invest in.
The Prosper operations work according to banking regulations. Moreover, we observe classification and statistical findings about how to be a good Prosper customer and how the Prosper operations benefit prospective investors.
During this process, I encountered quite a few thoughts in regards how I can improve either the documentation or analysis.
First, the difficulties I initially ran into the analysis was computing power, a gargantuate amount of variables, and just logistics of resolving empty entries. Though I ran into these issues, I feel I succeeded in resolving NA values, partitioning the data for sake of computing power, and just performing more than just a data visualization analysis.
This analysis could be enriched with more statistical findings and just proper resolution of filling in blank-spaced entries. Moreover, the addition to feature engineering would have been beneficial in the initial process.
Ending some of my thoughts, I feel I could have: 1. Not induce entries into NA value entries so quickly
Exploration into why there were 16 null values remaining, after our data-filling process
Subsetted the original “df” data in a more strategic manner
ListingKey: Unique key for each listing, same value as the ‘key’ used in the listing object in the API.
ListingNumber: The number that uniquely identifies the listing to the public as displayed on the website.
ListingCreationDate: The date the listing was created.
CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
Term: The length of the loan expressed in months.
LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
ClosedDate: Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
BorrowerAPR: The Borrower’s Annual Percentage Rate (APR) for the loan.
BorrowerRate: The Borrower’s interest rate for this loan.
LenderYield: The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
EstimatedLoss: Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.
EstimatedReturn: The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
ProsperRating (Alpha): The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
ListingCategory: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
BorrowerState: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
Occupation: The Occupation selected by the Borrower at the time they created the listing.
EmploymentStatus: The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.
IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
CurrentlyInGroup: Specifies whether or not the Borrower was in a group at the time the listing was created.
GroupKey: The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
DateCreditPulled: The date the credit profile was pulled.
CreditScoreRangeLower: The lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
CreditScoreRangeUpper: The upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
FirstRecordedCreditLine: The date the first credit line was opened.
CurrentCreditLines: Number of current credit lines at the time the credit profile was pulled.
OpenCreditLines: Number of open credit lines at the time the credit profile was pulled.
TotalCreditLinespast7years: Number of credit lines in the past seven years at the time the credit profile was pulled.
OpenRevolvingAccounts: Number of open revolving accounts at the time the credit profile was pulled.
OpenRevolvingMonthlyPayment: Monthly payment on revolving accounts at the time the credit profile was pulled.
InquiriesLast6Months: Number of inquiries in the past six months at the time the credit profile was pulled.
TotalInquiries: Total number of inquiries at the time the credit profile was pulled.
CurrentDelinquencies: Number of accounts delinquent at the time the credit profile was pulled.
AmountDelinquent: Dollars delinquent at the time the credit profile was pulled.
DelinquenciesLast7Years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
PublicRecordsLast10Years: Number of public records in the past 10 years at the time the credit profile was pulled.
PublicRecordsLast12Months: Number of public records in the past 12 months at the time the credit profile was pulled.
RevolvingCreditBalance: Dollars of revolving credit at the time the credit profile was pulled.
BankcardUtilization: The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
AvailableBankcardCredit: The total available credit via bank card at the time the credit profile was pulled.
TotalTrades: Number of trade lines ever opened at the time the credit profile was pulled.
TradesNeverDelinquent: Number of trades that have never been delinquent at the time the credit profile was pulled.
TradesOpenedLast6Months: Number of trades opened in the last 6 months at the time the credit profile was pulled.
DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
IncomeRange: The income range of the borrower at the time the listing was created.
IncomeVerifiable: The borrower indicated they have the required documentation to support their income.
StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created.
LoanKey: Unique key for each loan. This is the same key that is used in the API.
TotalProsperLoans: Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
TotalProsperPaymentsBilled: Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans.
OnTimeProsperPayments: Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans.
ProsperPaymentsLessThanOneMonthLate: Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
ProsperPaymentsOneMonthPlusLate: Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
ProsperPrincipalBorrowed: Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
ProsperPrincipalOutstanding: Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
ScorexChangeAtTimeOfListing: Borrower’s credit score change at the time the credit profile was pulled. This will be the change relative to the borrower’s last Prosper loan. This value will be null if the borrower had no prior loans.
LoanCurrentDaysDelinquent: The number of days delinquent.
LoanFirstDefaultedCycleNumber: The cycle the loan was charged off. If the loan has not charged off the value will be null.
LoanMonthsSinceOrigination: Number of months since the loan originated.
LoanNumber: Unique numeric value associated with the loan.
LoanOriginalAmount: The origination amount of the loan.
LoanOriginationDate: The date the loan was originated.
LoanOriginationQuarter: The quarter in which the loan was originated.
MonthlyLoanPayment The scheduled monthly loan payment.
LP_CustomerPayments: Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
LP_CustomerPrincipalPayments: Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
LP_InterestandFees: Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries.
LP_ServiceFees: Cumulative service fees paid by the investors who have invested in the loan.
LP_CollectionFees: Cumulative collection fees paid by the investors who have invested in the loan.
LP_GrossPrincipalLoss: The gross charged off amount of the loan.
LP_NetPrincipalLoss: The principal that remains uncollected after any recoveries.
LP_NonPrincipalRecoverypayments: The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal.
PercentFunded: Percent the listing was funded. Recommendations Number of recommendations the borrower had at the time the listing was created.
InvestmentFromFriendsCount: Number of friends that made an investment in the loan.
InvestmentFromFriendsAmount: Dollar amount of investments that were made by friends.
Investors: The number of investors that funded the loan.